Does language help generalization in vision models?. (arXiv:2104.08313v3 [cs.AI] UPDATED)
(2 min)
Vision models trained on multimodal datasets can benefit from the wide
availability of large image-caption datasets. A recent model (CLIP) was found
to generalize well in zero-shot and transfer learning settings. This could
imply that linguistic or "semantic grounding" confers additional generalization
abilities to the visual feature space. Here, we systematically evaluate various
multimodal architectures and vision-only models in terms of unsupervised
clustering, few-shot learning, transfer learning and adversarial robustness. In
each setting, multimodal training produced no additional generalization
capability compared to standard supervised visual training. We conclude that
work is still required for semantic grounding to help improve vision models.